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DETAILED ACTION 



1. 



Claims 1-7, 9-1 1, 13, 15-22, and 24-35 have been examined. 



Papers Submitted 



2. 



It is hereby acknowledged that the following papers have been received and placed of 



record in the file: RCE as received on 12/20/2004. 



Drawings 



3. 



The drawings are objected to because it is not clear hwy applicant's amended Fig. 9 



illustrates a method which is functionally different than the Fig.9 field on July 20, 2004. In the 
old Fig.9, if the clean load matched any tag, then it was checked to see if the load matching tag 
and store valid bit are 0. However, in the most recent Fig., the checking of the load matching tag 
and store valid bit being 0 is done when the clean load does not match any tag. The examiner 
would like applicant to confirm which Fig.9 is correct. Corrected drawing sheets in compliance 
with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the 
application. Any amended replacement drawing sheet should include all of the figures appearing 
on the immediate prior version of the sheet, even if only one figure is being amended. The figure 
or figure number of an amended drawing should not be labeled as "amended." If a drawing 
figure is to be canceled, the appropriate figure must be removed from the replacement sheet, and 
where necessary, the remaining figures must be renumbered and appropriate changes made to the 
brief description of the several views of the drawings for consistency. Additional replacement 
sheets may be necessary to show the renumbering of the remaining figures. The replacement 



Application/Control Number: 09/896,526 Page 3 

Art Unit: 2183 

sheet(s) should be labeled "Replacement Sheet" in the page header (as per 37 CFR 1.84(c)) so as 
not to obstruct any portion of the drawing figures. If the changes are not accepted by the 
examiner, the applicant will be notified and informed of any required corrective action in the 
next Office action. The objection to the drawings will not be held in abeyance. 

Claim Objections 

4. Claim 1 is objected to because of the following informalities: Please replace "A 
apparatus" with —An apparatus-. Appropriate correction is required. 

Claim Rejections - 35 USC §103 

5. The following is a quotation of 35 U.S.C. 103(a) which forms the basis for all 
obviousness rejections set forth in this Office action: 

(a) A patent may not be obtained though the invention is not identically disclosed or described as set forth in 
section 102 of this title, if the differences between the subject matter sought to be patented and the prior art are 
such that the subject matter as a whole would have been obvious at the time the invention was made to a person 
having ordinary skill in the art to which said subject matter pertains. Patentability shall not be negatived by the 
manner in which the invention was made. 

6. Claims 1-7, 9-10, and 29-35 are rejected under 35 U.S.C. 103(a) as being unpatentable 
over Sundaramoorthy et al, "Slipstream Processors: Improving both Performance and Fault 
Tolerance," ASPLOS, Nov. 2000 (as applied in the previous Office Action and herein referred to 
as Sundaramoorthy) in view of Hennessy and Patterson, "Computer Architecture - A 
Quantitative Approach, 2 nd Edition," 1996 (as applied in the previous Office Action and herein 
referred to as Hennessy). 

7. Referring to claim 1, Sundaramoorthy has taught an apparatus comprising: 
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a) a first processor and a second processor. See Fig. 1 and note the R-stream and A-stream 
processors. 

b) a plurality of memory devices coupled to the first processor and the second processor See 
Fig. 1 and note the I-cache and D-cache memories. 

c) a first buffer coupled to the first processor and the second processor, the first buffer being a 
register buffer. See Fig.l and also see column 10, lines 17-21, and lines 35-38 and note that data 
operands are passed from the A-stream processor to the R-stream processor via the delay buffer. 

d) a second buffer coupled to the first processor and the second processor, the second buffer 
being a trace buffer. See column 7, and note the paragraph beginning with bulleted paragraph 
beginning with "Conventional Fetching. . .". In this paragraph a trace predictor/buffer is 
disclosed which stores/buffers trace IDs. 

e) a plurality of memory instruction buffers coupled to the first processor and the second 
processor. Note from Fig. 1 that separate reorder buffers are connected to each processor); 

f) wherein the first processor and the second processor perform single threaded applications 
using multithreading resources (col. 1, lines 53-54, col. 2, lines 18-20: teaches that a single 
thread is instantiated twice to create two threads and each thread is run on different processors). 

g) the first processor executes a single threaded application ahead of the second processor 
executing said single threaded application to avoid misprediction. See column 1, 2 nd paragraph, 
and note one is executed ahead of the other so that control outcomes may be passed to the 
lagging thread. Also, see column 2, lines 37-43 and note that the R-stream receives accurate 
predictions. Hence, branch mispredictions are avoided. 
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h) Sundaramoorthy has not taught that the first and second processors each have a scoreboard 
and a decoder. 

However, Official Notice is taken that instruction decoders are well known and expected 
in the art. More specifically, after instructions are fetched by a processor, they must inherently 
be decoded so that the processor may determine what type of instruction has been fetched and 
consequently, what operation to perform. Clearly, if both processors of Sundaramoorthy are 
fetching instructions, then both sets must be decoded. As a result, it would have been obvious to 
have instruction decoders in each of the first and second processors so that instructions may be 
decoded. One would have been motivated to make such a modification to allow both processors 
to decode their own instructions. 

In addition, Hennessy has taught that a scoreboard allows instructions to execute out of 
order. As is known in the art, out-of-order execution is advantageous because it allows 
instructions to execute as soon as their resources are ready, thereby reducing stalling and CPU 
idleness. See pages 241 and 242. As a result, in order to allow both processors to benefit from 
such execution and resulting advantages, it would have been obvious to one of ordinary skill in 
the art at the time of the invention to modify each of the first and second processors of 
Sundaramoorthy to include scoreboards. 

8. Referring to claim 2, Sundaramoorthy in view of Hennessy has taught an apparatus as 
described in claim 1. Sundaramoorthy has further taught that the memory devices comprise a 
plurality of cache devices (Fig. 1, 1-Cache and D-Cache). 

9. Referring to claim 3, Sundaramoorthy in view of Hennessy has taught an apparatus as 
described in claim 1 . Sundaramoorthy has further taught that the first processor is coupled to at 
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least one of a plurality of zero level (L0) data cache devices and at least one of a plurality of L0 
instruction cache devices, and the second processor is coupled to at least one of the plurality of 
L0 data cache devices and at least one of the plurality of L0 instruction cache devices (fig. 1 
shows that each processor is connected to a separate data cache (D-Cache) and instruction (I- 
Cache) which can be considered as zero-level caches because they are directly connected to the 
execute cores). 

10. Referring to claim 4, Sundaramoorthy in view of Hennessy has taught an apparatus as 
described in claim 3. Sundaramoorthy has further taught that each of the plurality of L0 data 
cache devices store exact copies of store instruction data. Although this is not mentioned 
explicitly, it is deemed inherent to the design because as each processor is executing the same 
thread (col. 1, lines 53-54, col. 2, lines 18-20) the data caches in each processor must contain 
exact copies of data. And, this data is store instruction data because data that is stored to main 
memory is also stored in a data cache. 

1 1 . Referring to claim 5, Sundaramoorthy in view of Hennessy has taught an apparatus as 
described in claim 1 . Sundaramoorthy has further taught that the plurality of memory 
instruction buffers includes at least one store forwarding buffer (fig. 1, reorder buffer connected 
to A-stream processor) and at least one load-ordering buffer (fig. 1, reorder buffer connected to 
R-stream processor). 

12. Referring to claim 6, Sundaramoorthy in view of Hennessy has taught an apparatus as 
described in claim 5. Although Sundaramoorthy in view of Hennessy does not mention that the 
at least one store forwarding buffer (fig. 1, reorder buffer (ROB) connected to A-stream 
processor) comprises a structure having a plurality of entries, each of the plurality of entries 
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having a tag portion, a validity portion, a data portion, a store instruction identification (ID) 
portion, and a thread ID portion it is deemed inherent to the design. A ROB is used to order 
instructions completing execution hence must contain a plurality of entries. Also each entry must 
have a tag portion to index into the ROB, a validity portion to indicate whether an entry can be 
written to or read from, a data portion for storing the results of the instruction, a store instruction 
ID portion would be the instruction opcode of an entry, and a thread ID for indicating which 
thread that instruction belongs to. 

13. Referring to claim 7, Sundaramoorthy in view of Hennessy has taught an apparatus as 
described in claim 6. Although Sundaramoorthy in view of Hennessy does not mention that the 
at least one load ordering buffer (fig. 1, reorder buffer connected to R-stream processor) 
comprises a structure having a plurality of entries, each of the plurality of entries having a tag 
portion, an entry validity portion, a load identification (ID) portion, and a load thread ID portion 
it is deemed inherent to the design. A ROB is used to order instructions completing execution 
hence must contain a plurality of entries. Also each entry must have a tag portion to index into 
the ROB, a validity portion to indicate whether an entry can be written to or read from, a load - 
instruction ID portion would be the instruction opcode of an entry, and a thread ID for indicating 
which thread that instruction belongs to. 

14. Referring to claim 9, Sundaramoorthy in view of Hennessy has taught an apparatus as 
described in claim 1 . Furthermore, although Sundaramoorthy has taught that the trace buffer 
(delay buffer) is a FIFO queue (col. 10, line 17), they do not disclose that the trace buffer is a 
circular buffer having an array with head and tail pointers, the head and tail pointers having a 
wrap-around bit. However, "Official Notice" is taken that it is well known and expected in the 
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art to implement a FIFO queue as a circular buffer with head and tail pointers wherein head and 
tail pointers have a wrap-around bit. A circular buffer is useful to implement in hardware 
because only the head and tail pointers need to be incremented/decremented instead of actually 
physically shifting entries. A wrap around bit would also be needed to indicate whether the 
pointer has wrapped around the end of the queue. Therefore, it would been obvious to one of 
ordinary skill in the art at the time of the invention to have implemented the FIFO queue as a 
circular buffer with head and tail pointers, the head and tail pointers having a wrap around bit 
because it is known that a FIFO queue can be implemented as a circular buffer and it is easier to 
build in hardware. 

15. Referring to claim 10, Sundaramoorthy in view of Hennessy has taught an apparatus as 
described in claim 1 . Sundaramoorthy in view of Hennessy has not explicitly taught that the 
register buffer comprising an integer register buffer and a predicate register buffer. However, 
Official Notice is taken that integer registers and predicate registers are well known and expected 
in the art. By implementing integer registers, the system will be able to load and store integer 
data and perform integer operations quickly. Furthermore, by implementing predicate registers, 
the system will be able to achieve conditional execution of instructions without conditional 
branch instructions. Consequently, to achieve such functionality, it would have been obvious to 
one of ordinary skill in the art at the time of the invention to modify Sundaramoorthy in view of 
Hennessy to include a integer register buffer and a predicate register buffer in the register buffer 
(delay buffer). 

16. Referring to claim 29, Sundaramoorthy has taught a system comprising: 
a) a first processor (fig. 1, R-stream processor comprising of the execute core). 
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b) a second processor (fig. 1, A-stream processor comprising of the execute core). 

c) a bus coupled to the first processor and the second processor (fig. 1, a bus is shown between 
the first and second processors via the delay buffer); 

d) a plurality of local memory devices coupled to the first processor and the second processor 
(fig. 1, 1-cache and D-cache memories); 

e) a first buffer coupled to the first processor and the second processor, the first buffer being a 
register buffer. See Fig.l and also see column 10, lines 17-21, and lines 35-38 and note that data 
operands are passed from the A-stream processor to the R-stream processor via the delay buffer. 

d) a second buffer coupled to the first processor and the second processor, the second buffer 
being a trace buffer. See column 7, and note the paragraph beginning with bulleted paragraph 
beginning with "Conventional Fetching. . .". In this paragraph a trace predictor/buffer is 
disclosed which stores/buffers trace IDs. 

e) a plurality of memory instruction buffers coupled to the first processor and the second 
processor. Note from Fig. 1 that separate reorder buffers are connected to each processor); 

f) wherein the first processor and the second processor perform single threaded applications 
using multithreading resources (col. 1, lines 53-54, col. 2, lines 18-20: teaches that a single 
thread is instantiated twice to create two threads and each thread is run on different processors). 

g) the first processor executes a single threaded application ahead of the second processor 
executing said single threaded application to avoid misprediction. See column 1, 2 paragraph, 
and note one is executed ahead of the other so that control outcomes may be passed to the 
lagging thread. Also, see column 2, lines 37-43 and note that the R-stream receives accurate 
predictions. Hence, branch mispredictions are avoided. 



Application/Control Number: 09/896,526 Page 10 

Art Unit: 2183 

h) Sundaramoorthy has not taught that the first and second processors each have a scoreboard 
and a decoder. 

However, Official Notice is taken that instruction decoders are well known and expected 
in the art. More specifically, after instructions are fetched by a processor, they must inherently 
be decoded so that the processor may determine what type of instruction has been fetched and 
consequently, what operation to perform. Clearly, if both processors of Sundaramoorthy are 
fetching instructions, then both sets must be decoded. As a result, it would have been obvious to 
have instruction decoders in each of the first and second processors so that instructions may be 
decoded. One would have been motivated to make such a modification to allow both processors 
to decode their own instructions. 

In addition, Hennessy has taught that a scoreboard allows instructions to execute out of 
order. As is known in the art, out-of-order execution is advantageous because it allows 
instructions to execute as soon as their resources are ready, thereby reducing stalling and CPU 
idleness. See pages 241 and 242. As a result, in order to allow both processors to benefit from 
such execution and resulting advantages, it would have been obvious to one of ordinary skill in 
the art at the time of the invention to modify each of the first and second processors of 
Sundaramoorthy to include scoreboards. 

i) Sundaramoorthy also has not taught a main memory coupled to the bus. However, Official 
Notice is taken that it is well known and expected in the art to have a main memory connected to 
multiple processors via a common bus in a multi-processor environment. Since caches do not 
store every instruction and data item, main memory must exist to store all of it. Therefore it 
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would have been obvious to one of ordinary skill in the art at the time of the invention to have 
added a main memory coupled to the bus in the Sundaramoorthy reference. 

17. Referring to claim 30, Sundaramoorthy in view of Hennessy has taught a system as 
described in claim 29, wherein the memory devices comprise of a plurality of cache devices (Fig. 
1, 1-Cache and D-Cache). 

18. Referring to claim 31, Sundaramoorthy in view of Hennessy has taught a system as 
described in claim 29, wherein the first processor is coupled to at least one of a plurality of zero 
level (L0) data cache devices and at least one of a plurality of L0 instruction cache devices, and 
the second processor is coupled to at least one of the plurality of L0 data cache devices and at 
least one of the plurality of L0 instruction cache devices (fig. 1 shows that each processor is 
connected to a separate data cache (D-Cache) and instruction (I-Cache) which can be considered 
as zero-level caches because they are directly connected to the execute cores). 

19. Referring to claim 32, Sundaramoorthy in view of Hennessy has taught a system as 
described in claim 31, wherein each of the plurality of L0 data cache devices store exact copies 
of store instruction data. Although this is not mentioned explicitly, it is deemed inherent to the 
design because as each processor is executing the same thread (col. 1, lines 53-54, col. 2, lines 
18-20) the instruction and data caches in each processor must contain exact copies of instructions 
and data. And, this data is store instruction data because data that is stored to main memory is 
also stored in a data cache. 

20. Referring to claim 33, Sundaramoorthy in view of Hennessy has taught a system as 
described in claim 3 1 . Sundaramoorthy in view of Hennessy has not taught that the first 
processor and the second processor each sharing a first level (LI) cache device and a second 
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level (L2) cache device. However, Official Notice is taken that it is well known and expected in 
the art that processors in a multi-processor environment share LI and L2 cache devices. Such a 
scheme allows for the simplification of cache coherency in that both processors would be able to 
access the same up-to-date cache as opposed to one of the processors accessing out-of-date 
information in its own cache. Therefore, it would have been obvious to one of ordinary skill in 
the art at the time of the invention to have the first and second processors share LI and L2 cache 
devices. 

21 . Referring to claim 34, Sundaramoorthy in view of Hennessy has taught a system as 
described in claim 29. Sundaramoorthy has further taught that the plurality of memory 
instruction buffers includes at least one store forwarding buffer (fig. 1, reorder buffer connected 
to A-stream processor) and at least one load-ordering buffer (fig. 1, reorder buffer connected to 
R-stream processor). 

22. Referring to claim 35, Sundaramoorthy in view of Hennessy has taught a system as 
described in claim 34. Sundaramoorthy in view of Hennessy has not taught that the at least one 
store forwarding buffer (fig. 1, reorder buffer (ROB) connected to A-stream processor) 
comprises a structure having a plurality of entries, each of the plurality of entries having a tag 
portion, a validity portion, a data portion, a store instruction identification (ED) portion, and a 
thread ID portion it is deemed inherent to the design. A ROB is used to order instructions 
completing execution hence must contain a plurality of entries. Also each entry must have a tag 
portion to index into the ROB, a validity portion to indicate whether an entry can be written to or 
read from, a data portion for storing the results of the instruction, a store instruction ID portion 
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would be the instruction opcode of an entry, and a thread ID for indicating which thread that 
instruction belongs to. 

23. Claims 1 1, 13, and 15-19 are rejected under 35 U.S.C. 103(a) as being unpatentable over 
Sundaramoorthy in view of Hennessy, as applied above, and further in view of Akkary, WO 
99/3 1594 (as applied in the previous Office Action). 

24. Referring to claim 1 1, Sundaramoorthy has taught a method comprising: 

a) executing a plurality of instructions in a first thread by a first processor (col. 1, lines 53-54, 
col. 2, lines 18-23: The R-stream thread is executed by the R-stream processor in fig. 1). 

b) executing said plurality of instructions in the first thread by a second processor (col. 1, lines 
53-54, col. 2, lines 18-32: The A-stream thread, which is the same as the R-stream thread, is 
executed by the A-stream processor) as directed by the first processor (col. 4, lines 21-38: IR- 
detector and IR-predictor in fig. 1, which are part of the first processor i.e. R-stream processor, 
direct the second processor (A-stream processor) to execute instructions from the A-stream), the 
second processor executing said plurality of instructions ahead of the first processor (col. 2, lines 
20-23: A-stream runs ahead of the R-stream and it is executed by the second processor to avoid 
misprediction (See column 1, 2 nd paragraph, and note one is executed ahead of the other so that 
control outcomes may be passed to the lagging thread. Also, see column 2, lines 37-43 and note 
that the R-stream receives accurate predictions. Hence, branch mispredictions are avoided.). 
Note that even though less instructions may be executed by the first processor (due to removal), 
all of the instructions executed by the first processor will be also be executed by the second 
processor. 
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c) Sundaramoorthy has not taught tracking at least one register that is one of loaded from a 
register file buffer, and written by said second processor, said tracking executed by said second 
processor. However, Hennessy has taught the idea of a scoreboard which allows instructions to 
execute out of order. As is known in the art, out-of-order execution is advantageous because it 
allows instructions to execute as soon as their resources are ready, thereby reducing stalling and 
CPU idleness. See pages 241 and 242. As a result, in order to allow the second processor to 
benefit from such execution and resulting advantages, it would have been obvious to one of 
ordinary skill in the art at the time of the invention to modify the second processor of 
Sundaramoorthy to include a scoreboard. And, the inherent nature of a scoreboard is to track 
registers written by the second processor. See Fig. 4. 4 on page 247, and note that the system 
tracks when registers are ready so that execution may continue. For registers to be ready, it must 
be tracked when the writing to those registers completes. 

d) transmitting control flow information from the second processor to the first processor, the first 
processor avoiding branch prediction by receiving the control flow information. See column 1, 
2 nd paragraph, column 2, lines 37-43, and column 11, line 5. Note that accurate control 
information is sent to the R-stream so that predictions are not needed. The R-stream would 
instead know which way to go from predictions in the A-stream. 

e) transmitting results from the second processor to the first processor, the first processor 
avoiding executing a portion of instructions (col. 10, lines 17-21, 30-33, 35-38: results (data-flow 
information) are transmitted from the A-stream processor to the R-stream processor via the delay 
buffer, and these values are used directly by the instructions hence avoiding the execution of the 
portion of the instructions) by committing the results of the portion of instructions into a register 
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file from a first buffer, the first buffer being a trace buffer (Although this is not explicitly 
mentioned, it is deemed inherent to the design because col. 4 line 15 discloses the presence of a 
register file in the processor and as results are written to the register file so that they can be read 
from by future instructions, the results of the instructions from the trace buffer (delay buffer) 
must be written into the register file). 

f) Sundaramoorthy in view of Hennessy has not taught clearing a store validity bit and setting a 
mispredicted bit in a load entry in the first buffer if a replayed store instruction has a matching 
store identification (ED) portion in a second buffer, the second buffer being a load buffer. 
However, Official Notice is taken that it is well known and expected in the art to use load and 
store buffers for the proper handling of memory operations. Akkary discloses a system for 
ordering loads and stores in a multithreaded processor using load and store buffers (fig. 2, 
182,184). He discloses clearing a store validity bit (SB Hit field) in the load buffer if data came 
from memory (pg. 37, para. 3, line 4; pg. 38, line 1). Also when a store instruction is executed 
(which includes replayed stores), its address is compared with the store ED portion (addresses) of 
load instructions (pg. 36, para. 3). On a match, a replay event is signaled to the load entry in the 
trace buffer to replay the load instruction and all its dependant instructions because it was 
mispredicted (pg. 38, para. 2). Furthermore, Official Notice is taken that is well known and 
expected in the art to set a status bit to indicate a misprediction. Clearly, in order to detect a 
misprediction, some bit must change somewhere in the system. As shown in In re Larson , 144 
USPQ 347 (CCPA 1965), to make integral is generally not given patentable weight or would 
have been an obvious improvement. That is, it does not matter where this misprediction bit is 
located within the system, as long as it exists. One of ordinary skill in the art would have 
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recognized that one could use the load and store buffer arrangement of Akkary in the 
Sundaramoorthy reference in order handle loads and stores in the multithreaded environment. 
Therefore, it would have been obvious to one of ordinary skill in the art at the time of the 
invention to have modified the Sundaramoorthy reference by clearing a store validity bit and 
setting a mispredicted bit in a load entry in the trace buffer (delay buffer) if a replayed store 
instruction has a matching store ID portion. 

25. Referring to claim 13, Sundaramoorthy in view of Hennessy and further in view of 
Akkary has taught a method as described in claim 11. Sundaramoorthy has further taught 
duplicating memory information in separate memory devices for independent access by the first 
processor and the second processor. Although this is not mentioned explicitly, it is deemed 
inherent to the design because as each processor is executing the same thread (col. 1, lines 53-54, 
col. 2, lines 18-20) the instruction and data caches in each processor (fig. 1) must contain exact 
copies of instructions and data. 

26, Referring to claim 1 5, Sundaramoorthy in view of Hennessy and further in view of 
Akkary has taught a method as described in claim 11. Sundaramoorthy in view of Hennessy has 
not taught setting a store validity bit if a store instruction that is not replayed matches a store 
identification (ID) portion. However, Official Notice is taken that it is well known and expected 
in the art to use load and store buffers for the proper handling of memory operations. Akkary 
discloses a system for ordering loads and stores in a multithreaded processor using load and store 
buffers (fig. 2, 182,184). He discloses setting a store validity bit (SB Hit field) in the load buffer 
if data came from store buffer (pg. 37, para. 3, line 4; pg. 38, lines 1-2). In order for data to 
come from the store buffer, a store instruction address (including store instructions that are not 
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replayed) must match a store ID portion (address) of the load entry. One of ordinary skill in the 
art would have recognized that one could use the load and store buffer arrangement of Akkary in 
the Sundaramoorthy reference in order handle loads and stores in the multithreaded environment. 
Therefore it would have been obvious to one of ordinary skill in the art at the time of the 
invention to have modified the Sundaramoorthy reference by setting a store validity bit if a store 
instruction that is not replayed matches the store ID portion. 

27. Referring to claim 16, Sundaramoorthy in view of Hennessy and further in view of 
Akkary has taught a method as described in claim 1 1 . Furthermore, although Sundaramoorthy 
has taught flushing the pipeline (reorder buffer) of the R-stream on a misprediction, 
Sundaramoorthy has not taught flushing a pipeline, setting a mispredicted bit in a load entry in 
the trace buffer and restarting a load instruction if one of the load is not replayed and does not 
match a tag portion in a load buffer, and the load instruction matches the tag portion in the load 
buffer while a store valid bit is not set. However, Official Notice is taken that it is well known 
and expected in the art to use load and store buffers for the proper handling of memory 
operations. Akkary discloses a system for ordering loads and stores in a multithreaded processor 
using load and store buffers (fig. 2, 182,184). In particular, when a store valid bit is not set (SB 
hit = 0, pg. 38, para. 2) and when a store instruction compared with the addresses of load 
instructions (pg. 36, para. 3) is a match, a replay event is signaled to the load entry in the trace 
buffer to replay the load instruction and all its dependant instructions because it was 
mispredicted (pg. 38, para. 2). Furthermore, Official Notice is taken that is well known and 
expected in the art to set a status bit to indicate a misprediction. Clearly, in order to detect a 
misprediction, some bit must change somewhere in the system. As shown in In re Larson . 144 
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USPQ 347 (GCPA 1965), to make integral is generally not given patentable weight or would 
have been an obvious improvement. That is, it does not matter where this misprediction bit is 
located within the system, as long as it exists. One of ordinary skill in the art would have 
recognized that one could use the load and store buffer arrangement of Akkary in the 
Sundaramoorthy reference in order handle loads and stores in the multithreaded environment and 
flush the pipeline on reading the mispredicted bit. Therefore it would have been obvious to one 
ordinary skill in the art at the time of the invention to have modified the Sundaramoorthy 
reference by flushing a pipeline, setting a mispredicted bit in a load entry in the trace buffer and 
restarting a load instruction if one of the load is not replayed and does not match a tag portion in 
a load, buffer, and the load instruction matches the tag portion in the load buffer while a store 
valid bit is not set. 

28. Referring to claim 17, Sundaramoorthy in view of Hennessy and further in view of 
Akkary has taught a method as described in claim 1 1 . Sundaramoorthy has further taught 
executing a replay mode at a first instruction of a speculative thread (col. 1, lines 53-54, col. 2, 
lines 18-20: This feature is deemed inherent to the reference because when the A-stream is 
initially started, i.e., at the first instruction, there will be two redundant threads being executed 
which means the thread is being replayed from that point. This can be called a replay mode). 

29. Referring to claim 18, Sundaramoorthy in view of Hennessy and further in view of 
Akkary has taught a method as described in claim 11. Sundaramoorthy has further taught: 

a) issuing all instructions up to a next replayed instruction including dependent instructions (This 
feature is deemed inherent to the design because in order to execute the thread all instructions are 
issued in either one of the R-stream and A-stream processors). 
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b) issuing instructions that are not replayed as no-operation (NOPs) instructions (This feature is 
also deemed inherent to the design because if an instruction that is not replayed does not occupy 
a slot in the execution pipeline it will lead to improper functioning of the processor. Hence as the 
instruction that is not replayed is not to be executed, a NOP must be issued in its place). 

c) issuing all load instructions and store instructions to memory (This limitation is also deemed 
inherent to the design because if all loads and stores are not issued to memory, the state of the 
thread would be incorrect leading to the malfunctioning of the system). 

d) committing non-replayed instructions from the trace buffer to the register file (Although this is 
not explicitly mentioned, it is deemed inherent to the design because col. 4 line 15 discloses the 
presence of a register file in the processor and as results are written to the register file so that 
they can be read from by future instructions, the results of the instructions from the trace buffer 
(delay buffer) that are not going to be replayed must be written into the register file). 

e) Sundaramoorthy in view of Hennessy has not taught supplying names from the trace buffer to 
preclude register renaming. However, Hennessy and Patterson teach that register renaming is 
used to reduce name dependencies allowing instructions involved in name dependencies to 
execute simultaneously or be reordered (pg. 232, para. 5). As these dependencies are resolved, 
more instruction level parallelism can be extracted and performance can be improved. One of 
ordinary skill in the art would have recognized to use register renaming in the Sundaramoorthy 
reference because it too would improve performance. As the trace buffer (delay buffer) would 
also supply the names, it would be logical not to do the renaming again in the R-stream 
processor. Therefore, it would have been obvious to one of ordinary skill in the art at the time of 
the invention to have modified the Sundaramoorthy reference by adding register renaming 
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capabilities and supply names from the trace buffer to preclude register renaming. One would 
have been motivated to do so because it would improve performance which is one of the 
objectives of the Sundaramoorthy reference (col. 1, lines 29-36). 

30. Referring to claim 19, Sundaramoorthy in view of Hennessy and further in view of 
Akkary has taught a method as described in claim 11. Sundaramoorthy has further taught 
clearing a valid bit in an entry in a load buffer (fig. 1, the reorder buffer connected to the R- 
stream processor) if the load entry is retired (Although not explicitly mentioned, it is deemed 
inherent to the design because a load entry, on being retired, has to be marked invalid to ensure 
that other new instructions can occupy that entry safely). 

31. Claims 20-22 and 24-28 are rejected under 35 U.S.C. 103(a) as being unpatentable over 
Sundaramoorthy in view of Hennessy in view of Akkary, as applied above, and further in view 
of Tanenbaum, "Structured Computer Organization," Prentice-Hall, 1984, pp. 10-12 (as applied 
in the previous Office Action and herein referred to as Tanenbaum). 

32. Referring to claim 20, Sundaramoorthy has taught: 

a) executing a first thread from a first processor (col. 1, lines 53-54, col. 2, lines 18-23: The R- 
stream thread is executed by the R- stream processor in fig. 1). 

b) executing said first thread from a second processor (col. 1, lines 53-54, col. 2, lines 18-32: The 
A-stream thread, which is the same as the R-stream thread, is executed by the A-stream 
processor) as directed by the first processor (col. 4, lines 21-38: ER-detector and ER-predictor in 
fig. 1, considered part of the first processor i.e. R-stream processor, direct the second processor 
(A-stream processor) to execute instructions from the A-stream), the second processor executing 
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instructions ahead of the first processor to avoid misprediction (See column 1, 2 nd paragraph, 
and note one is executed ahead of the other so that control outcomes may be passed to the 
lagging thread. Also, see column 2, lines 37-43 and note that the R-stream receives accurate 
predictions. Hence, branch mispredictions are avoided.). 

c) Sundaramoorthy has not taught tracking at least one register that is one of loaded from a first 
buffer, and written by said second processor, said tracking executed by said second processor, 
the first buffer being a register file buffer. However, Hennessy has taught the idea of a 
scoreboard which allows instructions to execute out of order. As is known in the art, out-of- 
order execution is advantageous because it allows instructions to execute as soon as their 
resources are ready, thereby reducing stalling and CPU idleness. See pages 241 and 242. As a 
result, in order to allow the second processor to benefit from such execution and resulting 
advantages, it would have been obvious to one of ordinary skill in the art at the time of the 
invention to modify the second processor of Sundaramoorthy to include a scoreboard. And, the 
inherent nature of a scoreboard is to track registers written by the second processor. See Fig. 4. 4 
on page 247, and note that the system tracks when registers are ready so that execution may 
continue. For registers to be ready, it must be tracked when the writing to those registers 
completes. 

d) Sundaramoorthy in view of Hennessy has not taught clearing a store validity bit and setting a 
mispredicted bit in a load entry in a second buffer if a replayed store instruction has a matching 
store identification (ID) portion, the second buffer being a trace buffer. However, Official 
Notice is taken that it is well known and expected in the art to use load and store buffers for the 
proper handling of memory operations. Akkary discloses a system for ordering loads and stores 
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in a multithreaded processor using load and store buffers (fig. 2, 182,184). He discloses clearing 
a store validity bit (SB Hit field) in the load buffer if data came from memory (pg. 37, para. 3, 
line 4; pg. 38, line 1). Also when a store instruction is executed (which includes replayed stores), 
its address is compared with the store ID portion (addresses) of load instructions (pg. 36, para. 
3). On a match, a replay event is signaled to the load entry in the trace buffer to replay the load 
instruction and all its dependant instructions because it was mispredicted (pg. 38, para. 2). 
Furthermore, Official Notice is taken that is well known and expected in the art to set a status bit 
to indicate a misprediction. Clearly, in order to detect a misprediction, some bit must change 
somewhere in the system. As shown in In re Larson , 144 USPQ 347 (CCPA 1965), to make 
integral is generally not given patentable weight or would have been an obvious improvement. 
That is, it does not matter where this misprediction bit is located within the system, as long as it 
exists. One of ordinary skill in the art would have recognized that one could use the load and 
store buffer arrangement of Akkary in the Sundaramoorthy reference in order handle loads and 
stores in the multithreaded environment. Therefore, it would have been obvious to one of 
ordinary skill in the art at the time of the invention to have modified the Sundaramoorthy 
reference by clearing a store validity bit and setting a mispredicted bit in a load entry in the trace 
buffer (delay buffer) if a replayed store instruction has a matching store ID portion, 
e) Sundaramoorthy in view of Hennessy has not taught an apparatus comprising a machine- 
readable medium containing instructions which, when executed by a machine to perform the 
aforementioned operations. However, Tanenbaum has taught that any instruction executed by 
hardware can also be simulated in software (pg 1 1, para. 4, lines 1-2). He also teaches that 
hardware is generally immutable (first para, after sec. 1.4 header) while software allows for more 
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rapid change (pg. 1 1, para. 4, lines 2-4). One of ordinary skill in the art at the time of the 
invention would have been motivated to convert the Sundaramoorthy reference to software i.e. 
instructions on a machine readable medium because Tanenbaum teaches that hardware is 
generally immutable (first para, after sec. 1.4 header) while software allows for more rapid 
change (pg. 1 1, para. 4, lines 2-4). Therefore, to allow for ease of correction of mistakes, and/or 
an ease of addition of new functionality, it would have been obvious to one of ordinary skill in 
the art to have implemented the method of Sundaramoorthy by an apparatus comprising 
instructions recorded on a machine readable medium. 

33. Referring to claim 21, Sundaramoorthy in view of Hennessy in view of Akkary and 
further in view of Tanenbaum has taught an apparatus as described in claim 20. Sundaramoorthy 
has further taught transmitting control flow information from the second processor to the first 
processor, the first processor avoiding branch prediction by receiving the control flow 
information (col 10, lines 17-21, 30-35, 43-46). 

34. Referring to claim 22, Sundaramoorthy in view of Hennessy in view of Akkary and 
further in view of Tanenbaum has taught an apparatus as described in claim 21. Sundaramoorthy 
has further taught duplicating memory information in separate memory devices for independent 
access by the first processor and the second processor (Although this is not mentioned explicitly, 
it is deemed inherent to the design because as each processor is executing the same thread (col. 

1, lines 53-54, col. 2, lines 18-20) the instruction and data caches in each processor (fig. 1) must 
contain exact copies of instructions and data). 

35. Referring to claims 24-25, Sundaramoorthy in view of Hennessy in view of Akkary and 
further in view of Tanenbaum has taught an apparatus as described in claim 21. Furthermore, 
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claims 24-25 are rejected for the same reasons set forth in the rejections of claims 15-16, 
respectively. 

36. Referring to claim 26, Sundaramoorthy in view of Hennessy in view of Akkary and 
further in view of Tanenbaum has taught an apparatus as described in claim 21. Sundaramoorthy 
has further taught: 

a) executing a replay mode at a first instruction of a speculative thread (col. 1, lines 53-54, col. 2, 
lines 18-20: This feature is deemed inherent to the reference because when the A-stream is 
initially started i.e. at the first instruction, there will be two redundant threads being executed 
which means the thread is being replayed from that point. This can be called a replay mode). 

b) terminating the replay mode and the execution of the speculative thread if a partition in the 
second buffer is approaching an empty state (this limitation is also deemed inherent to the 
reference because when the partition in the trace buffer (delay buffer col. 10 lines 15+) is 
approaching an empty state that means the A-stream has stopped producing results and finished 
executing. Therefore now the replay mode and the A-stream are terminated). 

37. Referring to claim 27, Sundaramoorthy in view of Hennessy in view of Akkary and 
further in view of Tanenbaum has taught a method as described in claim 2 1 . Sundaramoorthy 
has further taught: 

a) issuing all instructions up to a next replayed instruction including dependent instructions (This 
feature is deemed inherent to the design because in order to execute the thread all instructions are 
issued in either one of the R-stream and A-stream processors). 

b) issuing instructions that are not replayed as no-operation (NOPs) instructions (This feature is 
also deemed inherent to the design because if an instruction that is not replayed does not occupy 
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a slot in the execution pipeline it will lead to improper functioning of the processor. Hence as the 
instruction that is not replayed is not to be executed, a NOP must be issued in its place). 

c) issuing all load instructions and store instructions to memory (This limitation is also deemed 
inherent to the design because if all loads and stores are not issued to memory, the state of the 
thread would be incorrect leading to the malfunctioning of the system). 

d) committing non-replayed instructions from the second buffer to a register file (Although this 
is not explicitly mentioned, it is deemed inherent to the design because col. 4 line 15 discloses 
the presence of a register file in the processor and as results are written to the register file so that 
they can be read from by future instructions, the results of the instructions from the trace buffer 
(delay buffer) that are not going to be replayed must be written into the register file). 

e) Sundaramoorthy in view of Hennessy has not taught supplying names from the second buffer 
to preclude register renaming. However, Hennessy and Patterson teach that register renaming is 
used to reduce name dependencies allowing instructions involved in name dependencies to 
execute simultaneously or be reordered (pg. 232, para. 5). As these dependencies are resolved, 
more instruction level parallelism can be extracted and performance can be improved. One of 
ordinary skill in the art would have recognized to use register renaming in the Sundaramoorthy 
reference because it too would improve performance. As the trace buffer (delay buffer) would 
also supply the names, it would be logical not to do the renaming again in the R-stream 
processor. Therefore, it would have been obvious to one of ordinary skill in the art at the time of 
the invention to have modified the Sundaramoorthy reference by adding register renaming 
capabilities and supply names from the trace buffer to preclude register renaming. One would 
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have been motivated to do so because it would improve performance which is one of the 
objectives of the Sundaramoorthy reference (col. 1, lines 29-36). 

38. Referring to claim 28, Sundaramoorthy in view of Hennessy in view of Akkary and 
further in view of Tanenbaum has taught an apparatus as described in claim 21. Sundaramoorthy 
has further taught clearing a valid bit in an entry in a load buffer (fig. 1, the reorder buffer 
connected to the R-stream processor) if the load entry is retired (Although not explicitly 
mentioned, it is deemed inherent to the design because a load entry, on being retired, has to be 
marked invalid to ensure that other new instructions can occupy that entry safely). 

Response to Arguments 

39. Applicant's arguments filed on November 23, 2004, have been fully considered but they 
are not persuasive. 

40. Applicant argues the novelty/rejection of claim 1 on pages 12-13 of the remarks, in 
substance that: 

"Sundaramoorthy discloses a multiprocessor system that executes pseudo-redundant programs 
on separate processors on the same chip. The redundant programs, however, have different 
amounts of instructions. That is, one of the programs has more instructions than the other. 
(Sundaramoorthy page 258, first column, lines 40-55). And, both programs run in parallel on two 
processors. That is, the same thread is run twice, where one thread is executed in advance of the 
other. Since the same thread is ran, it is obvious that both the first processor and the second 
processor execute the same amount of instructions." 

41 . These arguments are not found persuasive for the following reasons: 

a) Looking at the first paragraph in the first column (abstract), the examiner asserts that 
Sundaramoorthy does execute the same thread on both processors. Applicant is correct in saying 
that the redundant thread may be shorter but the reason for this is because by executing the 
advanced thread earlier, instructions from the redundant thread which serve no useful purpose 
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may be removed, thereby speeding up execution. Consequently, as explained in the abstract, the 
removal of these useless instructions shorten the lagging thread while still achieving the same 
result, i.e., the threads are equivalent. Furthermore, applicant does not claim executing the same 
exact number of instructions. Applicant merely claims that the same thread is executed on both 
processors. It is the examiner's view that two threads are the same if the results are the same, 
which is the case according to Sundaramoorthy. 

42. Applicant also made similar arguments for the remaining independent claims. 
Consequently, the examiner asserts that the response above also applies to these arguments. 

Conclusion 

43. The prior art made of record and not relied upon is considered pertinent to applicant's 
disclosure. Applicant is reminded that in amending in response to a rejection of claims, the 
patentable novelty must be clearly shown in view of the state of the art disclosed by the 
references cited and the objections made. Applicant must also show how the amendments avoid 
such references and objections. See 37 CFR § 1.111 (c). 

Rotenberg, "AR-SMT: A Microarchitectural Approach to Fault Tolerance in 
Microprocessors," 1999, has taught redundant execution of threads on a single processor where 
one is executed ahead of the other in order to share control information and operand information. 
A trace buffer is also disclosed. 

Mukherjee, U.S. Patent No. 6,757,81 1, has taught slack fetch to improve performance in 
a simultaneous and redundantly threaded processor. 
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Any inquiry concerning this communication or earlier communications from the 
examiner should be directed to David J. Huisman whose telephone number is (571) 272-4168. 
The examiner can normally be reached on Monday-Friday (8:00-4:30). 

If attempts to reach the examiner by telephone are unsuccessful, the examiner's 
supervisor, Eddie Chan can be reached on (571) 272-4162. The fax phone number for the 
organization where this application or proceeding is assigned is 703-872-9306. 

Information regarding the status of an application may be obtained from the Patent 
Application Information Retrieval (PAIR) system. Status information for published applications 
may be obtained from either Private PAIR or Public PAIR. Status information for unpublished 
applications is available through Private PAIR only. For more information about the PAIR 
system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR 
system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). 
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